The file ‘WeatherDownloads_202005_v002.Rmd’ contains code for dowloading and processing historical weather data as contained in METAR archives hosted by Iowa State University.
Data have been dowloaded and processed for several stations (airports) and years, with .rds files saved in “./RInputFiles/ProcessedMETAR”.
This module will perform exploratory data analysis on the processed weather files.
Each processed data file contains one year of hourly weather data for one station. Files are saved as ‘./RInputFiles/ProcessedMETAR/metar_kxxx_yyyy.rds’ where xxx is the three-digit airport code and yyyy is the four-digit year.
Each file contains the following variables:
There are several functions available for analysis:
plotCountsByMetric() - bar plots for counts by variable
plotNumCor() - plot two numeric variables against each other
plotFactorNumeric() - boxplot a numeric variable against a factor variable
corMETAR() - correlations between METAR variables
lmMETAR() - linear regression modeling for METAR variables
basicWindPlots() - plot wind speed and direction
getWindDirGroup() - convert wind direction to a grouping (e.g., N for 320-360-40)
consolidatePlotWind() - show frequency plots of wind direction, city, and month
The tidyverse library is loaded and the 2016 Detroit data is read in to show examples of the functions:
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
kdtw_2016 <- readRDS("./RInputFiles/ProcessedMETAR/metar_kdtw_2016.rds")
A variable mapping is created to allow for better readable variable names:
varMapper <- c(WindDir="Wind Direction (degrees)",
predomDir="General Prevailing Wind Direction",
WindSpeed="Wind Speed (kts)",
WindSpeed5="Wind Speed (kts), rounded to nearest 5 knots",
Visibility="Visibility (SM)",
TempC="Temperature (C)",
DewC="Dew Point (C)",
Altimeter="Altimeter (inches Hg)",
Altimeter10="Altimeter (inches Hg), rounded to nearest 0.1 inHg",
modSLP="Sea-Level Pressure (hPa)",
TempF="Temperature (F)",
DewF="Dew Point (F)",
TempF5="Temperature (F), rounded to nearest 5 degrees",
DewF5="Dew Point (F), rounded to nearest 5 degrees",
cType1="First Cloud Layer Type",
cLevel1="First Cloud Layer Height (ft)",
month="Month",
year="Year",
wType="Greatest Sky Obscuration",
day="Day of Month"
)
The function plotCountsByMetric() produces bar plots for counts by variable:
# Helper function for generating plots by key variables
plotcountsByMetric <- function(df,
mets,
title="",
rotateOn=20,
dropNA=TRUE,
diagnose=FALSE,
mapper=varMapper,
facetOn=NULL,
showCentral=FALSE
) {
# Function arguments
# df: dataframe or tibble containing raw data
# mets: character vector of variables for plotting counts
# title: character vector for plot title
# rotateOn: integer, x-axis labels will be rotated by 90 degrees if # categories >= rotateOn
# dropNA: boolean for whether to drop all NA prior to plotting (recommended for avoiding warnings)
# diagnose: boolean for whether to note in the log the number of NA observations dropped
# mapper: named list containing mapping from variable name to well-formatted name for titles and axes
# facetOn: a facetting variable for the supplied df (NULL for no faceting)
# showCentral: boolean for whether to show the central tendency over-plotted on the main data
# Function usage
# 1. By default, the function plots overall counts by metric for a given input
# 2. If facetOn is passed as a non-NULL, then the data in #1 will be facetted by facetOn
# 3. If showCentral=TRUE, then the overall mean will be plotted as a point on the main plot (only makes sense if facetOn has been selected)
# Plot of counts by key metric
for (x in mets) {
# If a facetting variable is provided, need to include this in the group_by
useVars <- x
if (!is.null(facetOn)) { useVars <- c(facetOn, useVars) }
dat <- df %>%
group_by_at(vars(all_of(useVars))) %>%
summarize(n=n())
if (dropNA) {
nOrig <- nrow(dat)
sumOrig <- sum(dat$n)
dat <- dat %>%
filter_all(all_vars(!is.na(.)))
if (diagnose & (nOrig > nrow(dat))) {
cat("\nDropping",
nOrig-nrow(dat),
"rows with",
sumOrig-sum(dat$n),
"observations due to NA\n"
)
}
}
# Create the main plot
p <- dat %>%
ggplot(aes_string(x=x, y="n")) +
geom_col() +
labs(title=title,
subtitle=paste0("Counts By: ", mapper[x]),
x=paste0(x, " - ", mapper[x]),
y="Count"
)
# If the rotateOn criteria is exceeded, rotate the x-axis by 90 degrees
if (nrow(dat) >= rotateOn) {
p <- p + theme(axis.text.x=element_text(angle=90))
}
# If facetting has been requested, facet by the desired variable
if (!is.null(facetOn)) {
p <- p + facet_wrap(as.formula(paste("~", facetOn)))
}
# If showCentral=TRUE, add a dot plot for the overall average
if (showCentral) {
# Get the median number of observations by facet, or the total if facetOn=NULL
if (is.null(facetOn)) {
useN <- sum(dat$n)
} else {
useN <- dat %>%
group_by_at(vars(all_of(facetOn))) %>%
summarize(n=sum(n)) %>%
pull(n) %>%
median()
}
# Get the overall percentages by x
centralData <- helperCountsByMetric(tbl=dat, ctVar=x, sumOn="n") %>%
mutate(centralValue=nPct*useN)
# Apply the median
p <- p + geom_point(data=centralData, aes(y=centralValue), color="red", size=2)
}
# Print the plot
print(p)
}
}
# Example for Detroit 2016 - using WindDir, cType1, month, wType
plotcountsByMetric(kdtw_2016,
mets=c("WindDir", "cType1", "month", "wType"),
title="Detroit, MI (2016)"
)
The function plotNumCor() plots two numeric variables against one another:
# Create a function for plotting two variables against each other
plotNumCor <- function(met,
var1,
var2,
title=NULL,
subT="",
dropNA=TRUE,
diagnose=FALSE,
mapper=varMapper,
facetOn=NULL,
showCentral=FALSE
) {
# Function arguments
# met: dataframe or tibble containing raw data
# var1: character vector of variable to be used for the x-axis
# var2: character vector of variable to be used for the y-axis
# title: character vector for plot title
# subT: character vector for plot subtitle
# dropNA: boolean for whether to drop all NA prior to plotting (recommended for avoiding warnings)
# diagnose: boolean for whether to note in the log the number of NA observations dropped
# mapper: named list containing mapping from variable name to well-formatted name for titles and axes
# facetOn: a facetting variable for the supplied met (NULL for no faceting)
# showCentral: boolean for whether to show the central tendency over-plotted on the main data
# Function usage
# 1. By default, the function plots overall counts by the provided x/y metrics, with each point sized based on the number of observations, and with an lm smooth overlaid
# 2. If facetOn is passed as a non-NULL, then the data in #1 will be facetted by facetOn
# 3. If showCentral=TRUE, then the lm smooth that best first to the overall data will be plotted (only makes sense if facetOn has been selected)
# Create the title if not passed
if (is.null(title)) {
title <- paste0("Hourly Observations of ", mapper[var1], " and ", mapper[var2])
}
# If a facetting variable is provided, need to include this in the group_by
useVars <- c(var1, var2)
if (!is.null(facetOn)) { useVars <- c(facetOn, useVars) }
# Pull the counts by useVars
dat <- met %>%
group_by_at(vars(all_of(useVars))) %>%
summarize(n=n())
# If NA requested to be excluded, remove anything with NA
if (dropNA) {
nOrig <- nrow(dat)
sumOrig <- sum(dat$n)
dat <- dat %>%
filter_all(all_vars(!is.na(.)))
if (diagnose) {
cat("\nDropping",
nOrig-nrow(dat),
"rows with",
sumOrig-sum(dat$n),
"observations due to NA\n"
)
}
}
p <- dat %>%
ggplot(aes_string(x=var1, y=var2)) +
geom_point(alpha=0.5, aes_string(size="n")) +
geom_smooth(method="lm", aes_string(weight="n")) +
labs(x=paste0(mapper[var1], " - ", var1),
y=paste0(mapper[var2], " - ", var2),
title=title,
subtitle=subT
)
# If facetting has been requested, facet by the desired variable
if (!is.null(facetOn)) {
p <- p + facet_wrap(as.formula(paste("~", facetOn)))
}
# If showCentral=TRUE, add a dashed line for the overall data
if (showCentral) {
p <- p + helperNumCor(dat, xVar=var1, yVar=var2, sumOn="n")
}
print(p)
}
# Example for Detroit 2016 - using TempC and TempF
plotNumCor(kdtw_2016, var1="TempC", var2="TempF", subT="Detroit, MI (2016)", diagnose=TRUE)
##
## Dropping 1 rows with 49 observations due to NA
# Example for Detroit 2016 - using TempC and DewC
plotNumCor(kdtw_2016, var1="TempC", var2="DewC", subT="Detroit, MI (2016)", diagnose=TRUE)
##
## Dropping 1 rows with 49 observations due to NA
# Example for Detroit 2016 - using Altimeter and modSLP
plotNumCor(kdtw_2016, var1="Altimeter", var2="modSLP", subT="Detroit, MI (2016)", diagnose=TRUE)
##
## Dropping 1 rows with 49 observations due to NA
The function plotFactorNumeric() creates box plots for a numeric variable against a factor variable:
# Updated function for plotting numeric by factor
plotFactorNumeric <- function(met,
fctVar,
numVar,
title=NULL,
subT="",
diagnose=TRUE,
showXLabel=TRUE,
mapper=varMapper,
facetOn=NULL,
showCentral=FALSE
) {
# Function arguments
# met: dataframe or tibble containing raw data
# fctVar: character vector of variable to be used for the x-axis (factor in the boxplot)
# numVar: character vector of variable to be used for the y-axis (numeric in the boxplot)
# title: character vector for plot title
# subT: character vector for plot subtitle
# diagnose: boolean for whether to note in the log the number of NA observations dropped
# showXLabel: boolean for whether to include the x-label (e.g., set to FALSE if using 'month')
# mapper: named list containing mapping from variable name to well-formatted name for titles and axes
# facetOn: a facetting variable for the supplied met (NULL for no faceting)
# showCentral: boolean for whether to show the central tendency over-plotted on the main data
# Function usage
# 1. By default, the function creates the boxplot of numVar by fctVar
# 2. If facetOn is passed as a non-NULL, then the data in #1 will be facetted by facetOn
# 3. If showCentral=TRUE, then the overall median of numVar by fctVar will be plotted as a red dot
# Create the title if not passed
if (is.null(title)) {
title <- paste0("Hourly Observations of ", mapper[numVar], " by ", mapper[fctVar])
}
# Remove the NA variables
nOrig <- nrow(met)
dat <- met %>%
filter(!is.na(get(fctVar)), !is.na(get(numVar)))
if (diagnose) { cat("\nRemoving", nOrig-nrow(dat), "records due to NA\n") }
# Create the base plot
p <- dat %>%
ggplot(aes_string(x=fctVar, y=numVar)) +
geom_boxplot(fill="lightblue") +
labs(title=title,
subtitle=subT,
x=ifelse(showXLabel, paste0(mapper[fctVar], " - ", fctVar), ""),
y=paste0(mapper[numVar], " - ", numVar)
)
# If facetting has been requested, facet by the desired variable
if (!is.null(facetOn)) {
p <- p + facet_wrap(as.formula(paste("~", facetOn)))
}
# If showCentral=TRUE, add a dot plot for the overall average
if (showCentral) {
centData <- helperFactorNumeric(dat, .f=median, byVar=fctVar, numVar=numVar)
p <- p + geom_point(data=centData, aes(y=helpFN), size=2, color="red")
}
# Render the final plot
print(p)
}
# Example for Detroit 2016 - using TempF and month
plotFactorNumeric(kdtw_2016,
fctVar="month",
numVar="TempF",
subT="Detroit, MI (2016)",
showXLabel=FALSE,
diagnose=TRUE
)
##
## Removing 49 records due to NA
# Example for Detroit 2016 - using WindSpeed and wType
plotFactorNumeric(kdtw_2016,
fctVar="wType",
numVar="WindSpeed",
subT="Detroit, MI (2016)",
showXLabel=TRUE,
diagnose=TRUE
)
##
## Removing 49 records due to NA
# Example for Detroit 2016 - using Visibility and wType
plotFactorNumeric(kdtw_2016,
fctVar="wType",
numVar="Visibility",
subT="Detroit, MI (2016)",
showXLabel=TRUE,
diagnose=TRUE
)
##
## Removing 49 records due to NA
An issue previous observed where visibility 1/16SM was interpreted as 16 statutory miles has been corrected in the ‘WeatherDownloads_202005_v002’ file.
The function corMETAR() calculates correlations among numeric variables in the METAR data:
# Function to calculate, display, and plot variable correlations
corMETAR <- function(met, numVars, subT="") {
# Keep only complete cases and report on data kept
dfUse <- met %>%
select_at(vars(all_of(numVars))) %>%
filter(complete.cases(.))
nU <- nrow(dfUse)
nM <- nrow(met)
myPct <- round(100*nU/nM, 1)
cat("\n *** Correlations use ", nU, " complete cases (", myPct, "% of ", nM, " total) ***\n", sep="")
# Create the correlation matrix
mtxCorr <- dfUse %>%
cor()
# Print the correlations
mtxCorr %>%
round(2) %>%
print()
# Display a heat map
corrplot::corrplot(mtxCorr,
method="color",
title=paste0("Hourly Weather Correlations\n", subT),
mar=c(0, 0, 2, 0)
)
}
# Example for Detroit, MI 2016
coreNum <- c("TempC", "TempF", "DewC", "DewF",
"Altimeter", "modSLP", "WindSpeed", "Visibility"
)
corMETAR(kdtw_2016, numVars=coreNum, subT="Detroit, MI (2016) METAR")
##
## *** Correlations use 8769 complete cases (99.4% of 8818 total) ***
## TempC TempF DewC DewF Altimeter modSLP WindSpeed Visibility
## TempC 1.00 1.00 0.92 0.92 -0.17 -0.24 -0.10 0.14
## TempF 1.00 1.00 0.92 0.92 -0.17 -0.24 -0.10 0.14
## DewC 0.92 0.92 1.00 1.00 -0.22 -0.28 -0.18 -0.07
## DewF 0.92 0.92 1.00 1.00 -0.22 -0.28 -0.18 -0.07
## Altimeter -0.17 -0.17 -0.22 -0.22 1.00 1.00 -0.37 0.15
## modSLP -0.24 -0.24 -0.28 -0.28 1.00 1.00 -0.35 0.14
## WindSpeed -0.10 -0.10 -0.18 -0.18 -0.37 -0.35 1.00 0.10
## Visibility 0.14 0.14 -0.07 -0.07 0.15 0.14 0.10 1.00
The function lmMETAR() runs simple linear regression models on the METAR data:
# Function for linear regressions on METAR data
lmMETAR <- function(met,
y,
x,
yName,
subT=""
) {
# Convert to formula
myChar <- paste0(y, " ~ ", x)
cat("\n *** Regression call is:", myChar, "***\n")
# Run regression
regr <- lm(formula(myChar), data=met)
# Summarize regression
print(summary(regr))
# Predict the new values
pred <- predict(regr, newdata=met)
# Plot the predictions
p <- met %>%
select_at(vars(all_of(y))) %>%
mutate(pred=pred) %>%
filter_all(all_vars(!is.na(.))) %>%
group_by_at(vars(all_of(c(y, "pred")))) %>%
summarize(n=n()) %>%
ggplot(aes_string(x=y, y="pred")) +
geom_point(aes(size=n), alpha=0.25) +
geom_smooth(aes(weight=n), method="lm") +
labs(title=paste0("Predicted vs. Actual ", yName, " - ", x, " as Predictor"),
subtitle=subT,
x=paste0("Actual ", yName),
y=paste0("Predicted ", yName)
)
print(p)
}
# Examples for Detroit, MI 2016
lmMETAR(kdtw_2016, "modSLP", "Altimeter", yName="Sea Level Pressure", subT="Detroit, MI (2016)")
##
## *** Regression call is: modSLP ~ Altimeter ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96361 -0.44022 -0.03758 0.40772 1.41448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -16.44389 0.74723 -22.01 <2e-16 ***
## Altimeter 34.41514 0.02488 1383.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4961 on 8767 degrees of freedom
## (49 observations deleted due to missingness)
## Multiple R-squared: 0.9954, Adjusted R-squared: 0.9954
## F-statistic: 1.913e+06 on 1 and 8767 DF, p-value: < 2.2e-16
lmMETAR(kdtw_2016, "modSLP", "Altimeter + TempF", yName="Sea Level Pressure", subT="Detroit, MI (2016)")
##
## *** Regression call is: modSLP ~ Altimeter + TempF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57802 -0.12674 0.00481 0.12820 0.65708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.713e+00 2.803e-01 -13.25 <2e-16 ***
## Altimeter 3.403e+01 9.302e-03 3658.80 <2e-16 ***
## TempF -2.351e-02 9.941e-05 -236.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1826 on 8766 degrees of freedom
## (49 observations deleted due to missingness)
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9994
## F-statistic: 7.086e+06 on 2 and 8766 DF, p-value: < 2.2e-16
lmMETAR(kdtw_2016, "TempC", "DewC", yName="Temperature (C)", subT="Detroit, MI (2016)")
##
## *** Regression call is: TempC ~ DewC ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.174 -3.172 -1.165 2.822 17.828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.17084 0.05373 114.9 <2e-16 ***
## DewC 0.99909 0.00470 212.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.463 on 8767 degrees of freedom
## (49 observations deleted due to missingness)
## Multiple R-squared: 0.8375, Adjusted R-squared: 0.8375
## F-statistic: 4.519e+04 on 1 and 8767 DF, p-value: < 2.2e-16
The basicWindPlots() function creates plots for wind speed and direction:
# Generate basic wind plots
basicWindPlots <- function(met,
dirVar="WindDir",
spdVar="WindSpeed",
desc="",
gran="",
mapper=varMapper
) {
# Plot for the wind direction
wDir <- met %>%
ggplot(aes_string(x=dirVar)) +
geom_bar() +
labs(title=paste0(desc, " Wind Direction"), subtitle=gran,
y="# Hourly Observations", x=mapper[dirVar]
) +
theme(axis.text.x=element_text(angle=90))
print(wDir)
# Plot for the minimum, average, and maximum wind speed by wind direction
# Wind direction 000 is reserved for 0 KT wind, while VRB is reserved for 3-6 KT variable winds
wSpeedByDir <- met %>%
filter(!is.na(get(dirVar))) %>%
group_by_at(vars(all_of(dirVar))) %>%
summarize(minWind=min(get(spdVar)), meanWind=mean(get(spdVar)), maxWind=max(get(spdVar))) %>%
ggplot(aes_string(x=dirVar)) +
geom_point(aes(y=meanWind), color="red", size=2) +
geom_errorbar(aes(ymin=minWind, ymax=maxWind)) +
labs(title=paste0(desc, " Wind Speed (Max, Mean, Min) By Wind Direction"),
subtitle=gran,
y=mapper[spdVar],
x=mapper[dirVar]
) +
theme(axis.text.x=element_text(angle=90))
print(wSpeedByDir)
# Plot for the wind speed
pctZero <- sum(pull(met, spdVar)==0, na.rm=TRUE) / nrow(met)
wSpeed <- met %>%
filter_at(vars(all_of(spdVar)), all_vars(!is.na(.))) %>%
ggplot(aes_string(x=spdVar)) +
geom_bar(aes(y=..count../sum(..count..))) +
labs(title=paste0(round(100*pctZero), "% of wind speeds in ", desc, " measure 0 Knots"),
subtitle=gran,
y="% Hourly Observations",
x=mapper[spdVar]
)
print(wSpeed)
# Polar plot for wind speed and wind direction
wData <- met %>%
filter_at(vars(all_of(dirVar)), all_vars(!is.na(.) & !(. %in% c("000", "VRB")))) %>%
filter_at(vars(all_of(spdVar)), all_vars(!is.na(.))) %>%
mutate_at(vars(all_of(dirVar)), as.numeric) %>%
group_by_at(vars(all_of(c(dirVar, spdVar)))) %>%
summarize(n=n())
wPolarDirSpeed <- wData %>%
ggplot(aes_string(x=spdVar, y=dirVar)) +
geom_point(alpha=0.1, aes(size=n)) +
coord_polar(theta="y") +
labs(title=paste0(desc, " Direction vs. Wind Speed"),
subtitle=gran,
x=mapper[spdVar],
y=mapper[dirVar]
) +
scale_y_continuous(limits=c(0, 360), breaks=c(0, 90, 180, 270, 360)) +
scale_x_continuous(limits=c(0, 40), breaks=c(0, 5, 10, 15, 20, 25, 30, 35, 40)) +
geom_point(aes(x=0, y=0), color="red", size=2)
print(wPolarDirSpeed)
}
# Example for Detroit, MI 2016
basicWindPlots(kdtw_2016, desc="Detroit, MI (2016)", gran="KDTW METAR")
The getWindDirGroup() function maps wind direction to a category such as NNE. Because the METAR data are recorded in units of 10 degrees, either 4 groupings (90 degrees each) or 12 groupings (30 degrees each) are preferred, so that each category has the same underlying number of buckets:
# Extract the wind direction data from a processed METAR file
getWindDirGroup <- function(met, src) {
# Use the fullMETAR data and extract WindDir, WindSpeed, month
windPlotData <- met %>%
select(WindDir, WindSpeed, month) %>%
mutate(windDirGroup=factor(case_when(WindSpeed==0 ~ "No Wind",
WindDir=="VRB" ~ "Variable",
WindDir %in% c("350", "360", "010") ~ "N",
WindDir %in% c("020", "030", "040") ~ "NNE",
WindDir %in% c("050", "060", "070") ~ "ENE",
WindDir %in% c("080", "090", "100") ~ "E",
WindDir %in% c("110", "120", "130") ~ "ESE",
WindDir %in% c("140", "150", "160") ~ "SSE",
WindDir %in% c("170", "180", "190") ~ "S",
WindDir %in% c("200", "210", "220") ~ "SSW",
WindDir %in% c("230", "240", "250") ~ "WSW",
WindDir %in% c("260", "270", "280") ~ "W",
WindDir %in% c("290", "300", "310") ~ "WNW",
WindDir %in% c("320", "330", "340") ~ "NNW",
TRUE ~ "Error"
),
levels=c("No Wind", "Variable", "Error",
"N", "NNE", "ENE",
"E", "ESE", "SSE",
"S", "SSW", "WSW",
"W", "WNW", "NNW"
)
)
)
# Rempve the errors and calculate percentages by month for the remainder
processedWindData <- windPlotData %>%
filter(windDirGroup != "Error") %>%
group_by(month, windDirGroup) %>%
summarize(n=n()) %>%
ungroup() %>%
group_by(month) %>%
mutate(pct=n/sum(n)) %>%
ungroup() %>%
mutate(src=src)
processedWindData
}
The function conslidatePlotWind() then calls getWindDirGroup() for any number of files:
# Consolidate and plot wind data
consolidatePlotWind <- function(files, names) {
consFun <- function(x, y) { getWindDirGroup(met=x, src=y) }
boundByRows <- map2_dfr(.x=files, .y=names, .f=consFun)
# Show frequency by month for each city, faceted by wind direction
p1 <- boundByRows %>%
ggplot(aes(x=month, y=pct, color=src)) +
geom_line(aes(group=src)) +
facet_wrap(~windDirGroup) +
labs(title="Wind Frequency by Month",
x="Month",
y="Frequency of Wind Observations"
) +
theme(axis.text.x=element_text(angle=90))
print(p1)
# Show frequency by wind direction for each city, faceted by month
p2 <- boundByRows %>%
ggplot(aes(x=windDirGroup, y=pct, color=src)) +
geom_line(aes(group=src)) +
facet_wrap(~month) +
labs(title="Wind Frequency by Wind Direction",
x="Wind Direction",
y="Frequency of Wind Observations"
) +
theme(axis.text.x=element_text(angle=90))
print(p2)
boundByRows
}
# Load the Las Vegas data and New Orleans data for comparison
kmsy_2016 <- readRDS("./RInputFiles/ProcessedMETAR/metar_kmsy_2016.rds")
klas_2016 <- readRDS("./RInputFiles/ProcessedMETAR/metar_klas_2016.rds")
# Run wind by month comparisons for Detroit, Las Vegas, New Orleans
consolidatePlotWind(files=list(kdtw_2016, klas_2016, kmsy_2016),
names=c("Detroit, MI (2016)", "Las Vegas, NV (2016)", "New Orleans, LA (2016)")
)
## # A tibble: 504 x 5
## month windDirGroup n pct src
## <fct> <fct> <int> <dbl> <chr>
## 1 Jan No Wind 46 0.0600 Detroit, MI (2016)
## 2 Jan Variable 3 0.00391 Detroit, MI (2016)
## 3 Jan N 26 0.0339 Detroit, MI (2016)
## 4 Jan NNE 26 0.0339 Detroit, MI (2016)
## 5 Jan ENE 9 0.0117 Detroit, MI (2016)
## 6 Jan E 16 0.0209 Detroit, MI (2016)
## 7 Jan ESE 14 0.0183 Detroit, MI (2016)
## 8 Jan SSE 72 0.0939 Detroit, MI (2016)
## 9 Jan S 131 0.171 Detroit, MI (2016)
## 10 Jan SSW 158 0.206 Detroit, MI (2016)
## # ... with 494 more rows
The functions can be combined so that a full process can be run for a given file:
# File name to city name mapper
cityNameMapper <- c(kdtw_2016="Detroit, MI (2016)",
kewr_2016="Newark, NJ (2016)",
kgrb_2016="Green Bay, WI (2016)",
kgrr_2016="Grand Rapids, MI (2016)",
kiah_2016="Houston, TX (2016)",
kind_2016="Indianapolis, IN (2016)",
klas_2015="Las Vegas, NV (2015)",
klas_2016="Las Vegas, NV (2016)",
klas_2017="Las Vegas, NV (2017)",
klnk_2016="Lincoln, NE (2016)",
kmke_2016="Milwaukee, WI (2016)",
kmsn_2016="Madison, WI (2016)",
kmsp_2016="Minneapolis, MN (2016)",
kmsy_2015="New Orleans, LA (2015)",
kmsy_2016="New Orleans, LA (2016)",
kmsy_2017="New Orleans, LA (2017)",
kord_2015="Chicago, IL (2015)",
kord_2016="Chicago, IL (2016)",
kord_2017="Chicago, IL (2017)",
ksan_2015="San Diego, CA (2015)",
ksan_2016="San Diego, CA (2016)",
ksan_2017="San Diego, CA (2017)",
ktvc_2016="Traverse City, MI (2016)"
)
# This is a helper function to create a locale description
getLocaleDescription <- function(x, mapper=cityNameMapper) {
# Initialize the description as NULL
desc <- NULL
for (potMatch in names(mapper)) {
if (str_detect(string=x, pattern=potMatch)) {
desc <- mapper[potMatch]
break
}
}
# If the mapping failed, use UNMAPPED_x as the description
if (is.null(desc)) {
desc <- paste0("UNMAPPED_", x)
cat("\nUnable to find a description, will use ", desc, "\n\n", sep="")
} else {
cat("\nWill use ", desc, " as the description for ", x, "\n\n", sep="")
}
# Return the descriptive name
desc
}
# The following function runs the functions that work on a single data source
combinedEDA <- function(filename=NULL,
tbl=NULL,
desc=NULL,
mets=c("WindDir", "WindSpeed", "TempC", "DewC", "Altimeter",
"modSLP", "cType1", "cLevel1", "month", "day"
),
corPairs=list(c("TempC", "TempF"),
c("TempC", "DewC"),
c("Altimeter", "modSLP"),
c("Altimeter", "WindSpeed")
),
fctPairs=list(c("month", "TempF"),
c("month", "DewF"),
c("month", "WindSpeed"),
c("month", "Altimeter"),
c("wType", "Visibility"),
c("wType", "WindSpeed"),
c("WindDir", "WindSpeed"),
c("WindDir", "TempF")
),
heatVars=c("TempC", "TempF",
"DewC", "DewF",
"Altimeter", "modSLP",
"WindSpeed", "Visibility",
"monthint", "day"
),
lmVars=list(c("modSLP", "Altimeter"),
c("modSLP", "Altimeter + TempF"),
c("TempF", "DewF"),
c("WindSpeed", "Altimeter + TempF + DewF + month")
),
mapVariables=varMapper,
mapFileNames=cityNameMapper,
path="./RInputFiles/ProcessedMETAR/"
) {
# Require that either filename OR tbl be passed
if (is.null(filename) & is.null(tbl)) {
cat("\nMust provide either a filename or an already-loaded tibble\n")
stop("\nfilename=NULL and tbl=NULL may not both be passed to combinedEDA()\n")
}
# Require that either 1) filename and mapFileNames, OR 2) desc be passed
if ((is.null(filename) | is.null(mapFileNames)) & is.null(desc)) {
cat("\nMust provide either a filename with mapFileNames or a file description\n")
stop("\nWhen desc=NULL must have non-null entries for both filename= and mapFileNames=\n")
}
# Find the description if it is NULL (default)
if (is.null(desc)) {
desc <- getLocaleDescription(filename, mapper=mapFileNames)
}
# Warn if both filename and tbl are passed, since tbl will be used
if (!is.null(filename) & !is.null(tbl)) {
cat("\nA tibble has been passed and will be used as the dataset for this function\n")
warning("\nArgument filename=", filename, " is NOT loaded since a tibble was passed\n")
}
# Read in the file unless tbl has already been passed to the routine
if (is.null(tbl)) {
tbl <- readRDS(paste0(path, filename))
}
# Plot counts by metric
plotcountsByMetric(tbl, mets=mets, title=desc, diagnose=TRUE)
# Plot relationships between two variables
for (ys in corPairs) {
plotNumCor(tbl, var1=ys[1], var2=ys[2], subT=desc, diagnose=TRUE)
}
# plot numeric vs. factor
for (ys in fctPairs) {
plotFactorNumeric(tbl, fctVar=ys[1], numVar=ys[2], subT=desc, showXLabel=FALSE, diagnose=TRUE)
}
# Heatmap for variable correlations
corMETAR(tbl, numVars=heatVars, subT=paste0(desc, " METAR"))
# Run linear rergression
for (ys in lmVars) {
lmMETAR(tbl, y=ys[1], x=ys[2], yName=varMapper[ys[1]], subT=desc)
}
# Run the basic wind plots
basicWindPlots(tbl, desc=desc, gran="")
# Return the tibble
tbl
}
The process can then be run for each of Detroit (2016), Las Vegas (2016), and New Orleans (2016):
# Retrieve the Detroit, MI (2016) data
kdtw_2016 <- combinedEDA("metar_kdtw_2016.rds")
##
## Will use Detroit, MI (2016) as the description for metar_kdtw_2016.rds
##
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 651 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Dropping 1 rows with 49 observations due to NA
##
## Removing 49 records due to NA
##
## Removing 49 records due to NA
##
## Removing 49 records due to NA
##
## Removing 49 records due to NA
##
## Removing 49 records due to NA
##
## Removing 49 records due to NA
##
## Removing 49 records due to NA
##
## Removing 49 records due to NA
##
## *** Correlations use 8769 complete cases (99.4% of 8818 total) ***
## TempC TempF DewC DewF Altimeter modSLP WindSpeed Visibility
## TempC 1.00 1.00 0.92 0.92 -0.17 -0.24 -0.10 0.14
## TempF 1.00 1.00 0.92 0.92 -0.17 -0.24 -0.10 0.14
## DewC 0.92 0.92 1.00 1.00 -0.22 -0.28 -0.18 -0.07
## DewF 0.92 0.92 1.00 1.00 -0.22 -0.28 -0.18 -0.07
## Altimeter -0.17 -0.17 -0.22 -0.22 1.00 1.00 -0.37 0.15
## modSLP -0.24 -0.24 -0.28 -0.28 1.00 1.00 -0.35 0.14
## WindSpeed -0.10 -0.10 -0.18 -0.18 -0.37 -0.35 1.00 0.10
## Visibility 0.14 0.14 -0.07 -0.07 0.15 0.14 0.10 1.00
## monthint 0.24 0.24 0.32 0.32 0.19 0.17 -0.06 -0.11
## day 0.04 0.04 0.03 0.03 -0.02 -0.02 0.06 0.04
## monthint day
## TempC 0.24 0.04
## TempF 0.24 0.04
## DewC 0.32 0.03
## DewF 0.32 0.03
## Altimeter 0.19 -0.02
## modSLP 0.17 -0.02
## WindSpeed -0.06 0.06
## Visibility -0.11 0.04
## monthint 1.00 0.02
## day 0.02 1.00
##
## *** Regression call is: modSLP ~ Altimeter ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96361 -0.44022 -0.03758 0.40772 1.41448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -16.44389 0.74723 -22.01 <2e-16 ***
## Altimeter 34.41514 0.02488 1383.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4961 on 8767 degrees of freedom
## (49 observations deleted due to missingness)
## Multiple R-squared: 0.9954, Adjusted R-squared: 0.9954
## F-statistic: 1.913e+06 on 1 and 8767 DF, p-value: < 2.2e-16
##
## *** Regression call is: modSLP ~ Altimeter + TempF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57802 -0.12674 0.00481 0.12820 0.65708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.713e+00 2.803e-01 -13.25 <2e-16 ***
## Altimeter 3.403e+01 9.302e-03 3658.80 <2e-16 ***
## TempF -2.351e-02 9.941e-05 -236.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1826 on 8766 degrees of freedom
## (49 observations deleted due to missingness)
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9994
## F-statistic: 7.086e+06 on 2 and 8766 DF, p-value: < 2.2e-16
##
## *** Regression call is: TempF ~ DewF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.132 -6.065 -2.083 4.042 31.029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.060663 0.211976 52.18 <2e-16 ***
## DewF 1.000977 0.004677 214.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.987 on 8767 degrees of freedom
## (49 observations deleted due to missingness)
## Multiple R-squared: 0.8393, Adjusted R-squared: 0.8393
## F-statistic: 4.58e+04 on 1 and 8767 DF, p-value: < 2.2e-16
##
## *** Regression call is: WindSpeed ~ Altimeter + TempF + DewF + month ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.3544 -2.5454 -0.1494 2.3669 20.6042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 306.827720 6.726512 45.615 < 2e-16 ***
## Altimeter -9.980907 0.222840 -44.790 < 2e-16 ***
## TempF 0.206384 0.006044 34.145 < 2e-16 ***
## DewF -0.217533 0.006755 -32.204 < 2e-16 ***
## monthFeb 0.188758 0.204057 0.925 0.3550
## monthMar -1.375926 0.211244 -6.513 7.75e-11 ***
## monthApr -2.406187 0.217870 -11.044 < 2e-16 ***
## monthMay -3.616016 0.248432 -14.555 < 2e-16 ***
## monthJun -4.337974 0.277532 -15.631 < 2e-16 ***
## monthJul -3.428235 0.300872 -11.394 < 2e-16 ***
## monthAug -2.971582 0.312087 -9.522 < 2e-16 ***
## monthSep -1.839206 0.292614 -6.285 3.43e-10 ***
## monthOct -0.477868 0.255640 -1.869 0.0616 .
## monthNov -0.013008 0.226830 -0.057 0.9543
## monthDec 2.071836 0.200848 10.315 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.883 on 8754 degrees of freedom
## (49 observations deleted due to missingness)
## Multiple R-squared: 0.3218, Adjusted R-squared: 0.3207
## F-statistic: 296.7 on 14 and 8754 DF, p-value: < 2.2e-16
# Retrieve the Las Vegas, NV (2016) data
klas_2016 <- combinedEDA("metar_klas_2016.rds")
##
## Will use Las Vegas, NV (2016) as the description for metar_klas_2016.rds
##
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 2440 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Dropping 1 rows with 35 observations due to NA
##
## Removing 35 records due to NA
##
## Removing 35 records due to NA
##
## Removing 35 records due to NA
##
## Removing 35 records due to NA
##
## Removing 35 records due to NA
##
## Removing 35 records due to NA
##
## Removing 35 records due to NA
##
## Removing 35 records due to NA
##
## *** Correlations use 8783 complete cases (99.6% of 8818 total) ***
## TempC TempF DewC DewF Altimeter modSLP WindSpeed Visibility
## TempC 1.00 1.00 0.22 0.22 -0.51 -0.62 0.22 0.01
## TempF 1.00 1.00 0.22 0.22 -0.51 -0.62 0.22 0.01
## DewC 0.22 0.22 1.00 1.00 -0.24 -0.27 -0.04 -0.13
## DewF 0.22 0.22 1.00 1.00 -0.24 -0.27 -0.04 -0.13
## Altimeter -0.51 -0.51 -0.24 -0.24 1.00 0.99 -0.38 0.06
## modSLP -0.62 -0.62 -0.27 -0.27 0.99 1.00 -0.38 0.06
## WindSpeed 0.22 0.22 -0.04 -0.04 -0.38 -0.38 1.00 -0.02
## Visibility 0.01 0.01 -0.13 -0.13 0.06 0.06 -0.02 1.00
## monthint 0.14 0.14 0.06 0.06 0.06 0.03 -0.01 -0.02
## day -0.01 -0.01 0.03 0.03 -0.01 -0.01 0.00 -0.07
## monthint day
## TempC 0.14 -0.01
## TempF 0.14 -0.01
## DewC 0.06 0.03
## DewF 0.06 0.03
## Altimeter 0.06 -0.01
## modSLP 0.03 -0.01
## WindSpeed -0.01 0.00
## Visibility -0.02 -0.07
## monthint 1.00 0.02
## day 0.02 1.00
##
## *** Regression call is: modSLP ~ Altimeter ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1381 -0.7848 -0.0607 0.6891 3.6362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -116.30984 1.72975 -67.24 <2e-16 ***
## Altimeter 37.68826 0.05776 652.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9753 on 8781 degrees of freedom
## (35 observations deleted due to missingness)
## Multiple R-squared: 0.9798, Adjusted R-squared: 0.9798
## F-statistic: 4.258e+05 on 1 and 8781 DF, p-value: < 2.2e-16
##
## *** Regression call is: modSLP ~ Altimeter + TempF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.18161 -0.33113 0.02395 0.33395 1.08509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.537e+01 8.693e-01 -29.18 <2e-16 ***
## Altimeter 3.478e+01 2.868e-02 1212.89 <2e-16 ***
## TempF -5.581e-02 2.813e-04 -198.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4165 on 8780 degrees of freedom
## (35 observations deleted due to missingness)
## Multiple R-squared: 0.9963, Adjusted R-squared: 0.9963
## F-statistic: 1.187e+06 on 2 and 8780 DF, p-value: < 2.2e-16
##
## *** Regression call is: TempF ~ DewF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.026 -14.381 -0.341 13.299 46.312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.12523 0.49251 126.14 <2e-16 ***
## DewF 0.32810 0.01576 20.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.94 on 8781 degrees of freedom
## (35 observations deleted due to missingness)
## Multiple R-squared: 0.04705, Adjusted R-squared: 0.04694
## F-statistic: 433.5 on 1 and 8781 DF, p-value: < 2.2e-16
##
## *** Regression call is: WindSpeed ~ Altimeter + TempF + DewF + month ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1199 -2.7665 -0.6134 2.0975 22.2630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 316.166903 9.583584 32.990 < 2e-16 ***
## Altimeter -10.363527 0.316290 -32.766 < 2e-16 ***
## TempF 0.012196 0.005239 2.328 0.019933 *
## DewF -0.053040 0.004107 -12.915 < 2e-16 ***
## monthFeb 1.477528 0.225729 6.546 6.26e-11 ***
## monthMar 0.848312 0.232319 3.652 0.000262 ***
## monthApr 1.590035 0.242532 6.556 5.84e-11 ***
## monthMay 0.327608 0.261841 1.251 0.210905
## monthJun 0.141204 0.319290 0.442 0.658325
## monthJul 1.092983 0.328362 3.329 0.000876 ***
## monthAug 0.545515 0.319165 1.709 0.087450 .
## monthSep 0.647423 0.281377 2.301 0.021420 *
## monthOct 1.147640 0.253286 4.531 5.95e-06 ***
## monthNov 1.097005 0.227198 4.828 1.40e-06 ***
## monthDec 0.672432 0.214810 3.130 0.001752 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.181 on 8768 degrees of freedom
## (35 observations deleted due to missingness)
## Multiple R-squared: 0.1774, Adjusted R-squared: 0.176
## F-statistic: 135 on 14 and 8768 DF, p-value: < 2.2e-16
# Retrieve the New Orleans, LA (2016) data
kmsy_2016 <- combinedEDA("metar_kmsy_2016.rds")
##
## Will use New Orleans, LA (2016) as the description for metar_kmsy_2016.rds
##
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 1029 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Dropping 1 rows with 14 observations due to NA
##
## Removing 14 records due to NA
##
## Removing 14 records due to NA
##
## Removing 14 records due to NA
##
## Removing 14 records due to NA
##
## Removing 14 records due to NA
##
## Removing 14 records due to NA
##
## Removing 14 records due to NA
##
## Removing 14 records due to NA
##
## *** Correlations use 8799 complete cases (99.8% of 8813 total) ***
## TempC TempF DewC DewF Altimeter modSLP WindSpeed Visibility
## TempC 1.00 1.00 0.85 0.85 -0.48 -0.47 -0.09 0.13
## TempF 1.00 1.00 0.85 0.85 -0.48 -0.48 -0.09 0.13
## DewC 0.85 0.85 1.00 1.00 -0.56 -0.56 -0.21 -0.06
## DewF 0.85 0.85 1.00 1.00 -0.57 -0.56 -0.21 -0.05
## Altimeter -0.48 -0.48 -0.56 -0.57 1.00 1.00 -0.08 0.08
## modSLP -0.47 -0.48 -0.56 -0.56 1.00 1.00 -0.08 0.08
## WindSpeed -0.09 -0.09 -0.21 -0.21 -0.08 -0.08 1.00 0.03
## Visibility 0.13 0.13 -0.06 -0.05 0.08 0.08 0.03 1.00
## monthint 0.26 0.26 0.31 0.31 0.09 0.09 -0.22 -0.02
## day 0.03 0.03 0.07 0.07 0.02 0.02 -0.05 0.01
## monthint day
## TempC 0.26 0.03
## TempF 0.26 0.03
## DewC 0.31 0.07
## DewF 0.31 0.07
## Altimeter 0.09 0.02
## modSLP 0.09 0.02
## WindSpeed -0.22 -0.05
## Visibility -0.02 0.01
## monthint 1.00 0.02
## day 0.02 1.00
##
## *** Regression call is: modSLP ~ Altimeter ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.232515 -0.083700 0.000544 0.084787 0.227320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.773456 0.230067 12.05 <2e-16 ***
## Altimeter 33.779607 0.007654 4413.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1015 on 8797 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9995
## F-statistic: 1.948e+07 on 1 and 8797 DF, p-value: < 2.2e-16
##
## *** Regression call is: modSLP ~ Altimeter + TempF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.222066 -0.082359 -0.001036 0.083533 0.217460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.557e+00 2.638e-01 5.903 3.7e-09 ***
## Altimeter 3.382e+01 8.667e-03 3902.029 < 2e-16 ***
## TempF 8.751e-04 9.432e-05 9.277 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.101 on 8796 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.9996, Adjusted R-squared: 0.9996
## F-statistic: 9.833e+06 on 2 and 8796 DF, p-value: < 2.2e-16
##
## *** Regression call is: TempF ~ DewF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.051 -4.777 -1.265 4.199 26.477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.440028 0.319858 76.41 <2e-16 ***
## DewF 0.778877 0.005065 153.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.763 on 8797 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.7288, Adjusted R-squared: 0.7288
## F-statistic: 2.364e+04 on 1 and 8797 DF, p-value: < 2.2e-16
##
## *** Regression call is: WindSpeed ~ Altimeter + TempF + DewF + month ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9452 -2.5710 -0.2247 2.2592 18.9024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 190.986029 11.812622 16.168 < 2e-16 ***
## Altimeter -6.223347 0.387357 -16.066 < 2e-16 ***
## TempF 0.225378 0.007290 30.915 < 2e-16 ***
## DewF -0.169883 0.006513 -26.082 < 2e-16 ***
## monthFeb -0.787460 0.204817 -3.845 0.000122 ***
## monthMar -0.270554 0.215758 -1.254 0.209885
## monthApr -1.941174 0.225642 -8.603 < 2e-16 ***
## monthMay -3.763407 0.241926 -15.556 < 2e-16 ***
## monthJun -5.215285 0.269852 -19.326 < 2e-16 ***
## monthJul -5.568010 0.283515 -19.639 < 2e-16 ***
## monthAug -4.834769 0.277122 -17.446 < 2e-16 ***
## monthSep -5.769099 0.272789 -21.149 < 2e-16 ***
## monthOct -4.812814 0.245841 -19.577 < 2e-16 ***
## monthNov -2.827680 0.220138 -12.845 < 2e-16 ***
## monthDec -0.203742 0.210552 -0.968 0.333241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.847 on 8784 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.2278, Adjusted R-squared: 0.2266
## F-statistic: 185.1 on 14 and 8784 DF, p-value: < 2.2e-16
There is a lot of good information in the EDA, but it can be overwhelming to have everything in one place. Perhaps a wrapper function can be built allowing for outputs to be piped to a given file:
wrapCombinedEDA <- function(readFile,
readPath="./RInputFiles/ProcessedMETAR/",
mapFileNames=cityNameMapper,
desc=NULL,
writeLogFile=NULL,
writeLogPDF=NULL,
writeLogPath=NULL,
appendWriteFile=FALSE,
...
) {
# Read in the requested file
tbl <- readRDS(paste0(readPath, readFile))
# Find the description if it has not been passed
if (is.null(desc)) {
desc <- getLocaleDescription(readFile, mapper=mapFileNames)
}
# Helper function that only runs the combinedEDA() routine
coreFunc <- function() { combinedEDA(tbl=tbl, desc=desc, mapFileNames=mapFileNames, ...) }
# If writeLogPDF is not NULL, direct the graphs to a suitable PDF
if (!is.null(writeLogPDF)) {
# Prepend the provided log path if it has not been made available
if (!is.null(writeLogPath)) {
writeLogPDF <- paste0(writeLogPath, writeLogPDF)
}
# Provide the location of the EDA pdf file
cat("\nEDA PDF file is available at:", writeLogPDF, "\n")
# Redirect the writing to writeLogPDF
pdf(writeLogPDF)
}
# Run EDA on the tbl using capture.output to redirect to a log file if specified
if (!is.null(writeLogFile)) {
# Prepend the provided log path if it has not been made available
if (!is.null(writeLogPath)) {
writeLogFile <- paste0(writeLogPath, writeLogFile)
}
# Provide the location of the EDA log file
cat("\nEDA log file is available at:", writeLogFile, "\n")
# Run EDA such that the output goes to the log file
capture.output(coreFunc(),
file=writeLogFile,
append=appendWriteFile
)
} else {
# Run EDA such that output stays in stdout
coreFunc()
}
# If writeLogPDF is not NULL, redirect to stdout
if (!is.null(writeLogPDF)) {
dev.off()
}
# Return the tbl
tbl
}
filePath <- "./RInputFiles/ProcessedMETAR/"
# Example for the basic function for Chicago, IL (2016) written to stdout
kord_2016 <- wrapCombinedEDA("metar_kord_2016.rds", readPath=filePath)
##
## Will use Chicago, IL (2016) as the description for metar_kord_2016.rds
##
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 823 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Dropping 1 rows with 10 observations due to NA
##
## Removing 10 records due to NA
##
## Removing 10 records due to NA
##
## Removing 10 records due to NA
##
## Removing 10 records due to NA
##
## Removing 10 records due to NA
##
## Removing 10 records due to NA
##
## Removing 10 records due to NA
##
## Removing 10 records due to NA
##
## *** Correlations use 8805 complete cases (99.9% of 8815 total) ***
## TempC TempF DewC DewF Altimeter modSLP WindSpeed Visibility
## TempC 1.00 1.00 0.93 0.93 -0.22 -0.29 -0.12 0.19
## TempF 1.00 1.00 0.93 0.93 -0.22 -0.29 -0.12 0.19
## DewC 0.93 0.93 1.00 1.00 -0.27 -0.34 -0.19 0.05
## DewF 0.93 0.93 1.00 1.00 -0.28 -0.34 -0.19 0.05
## Altimeter -0.22 -0.22 -0.27 -0.28 1.00 1.00 -0.31 0.19
## modSLP -0.29 -0.29 -0.34 -0.34 1.00 1.00 -0.29 0.17
## WindSpeed -0.12 -0.12 -0.19 -0.19 -0.31 -0.29 1.00 0.01
## Visibility 0.19 0.19 0.05 0.05 0.19 0.17 0.01 1.00
## monthint 0.24 0.24 0.27 0.27 0.18 0.16 -0.09 0.02
## day 0.05 0.05 0.05 0.05 -0.06 -0.06 0.08 0.04
## monthint day
## TempC 0.24 0.05
## TempF 0.24 0.05
## DewC 0.27 0.05
## DewF 0.27 0.05
## Altimeter 0.18 -0.06
## modSLP 0.16 -0.06
## WindSpeed -0.09 0.08
## Visibility 0.02 0.04
## monthint 1.00 0.02
## day 0.02 1.00
##
## *** Regression call is: modSLP ~ Altimeter ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.31041 -0.47561 -0.07852 0.43888 1.66927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.11596 0.85978 -26.89 <2e-16 ***
## Altimeter 34.63779 0.02864 1209.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5573 on 8803 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.994, Adjusted R-squared: 0.994
## F-statistic: 1.463e+06 on 1 and 8803 DF, p-value: < 2.2e-16
##
## *** Regression call is: modSLP ~ Altimeter + TempF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.69460 -0.12269 0.00055 0.12142 0.73628
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.603e+00 2.864e-01 -16.07 <2e-16 ***
## Altimeter 3.407e+01 9.502e-03 3585.30 <2e-16 ***
## TempF -2.606e-02 9.503e-05 -274.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1804 on 8802 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9994
## F-statistic: 7.018e+06 on 2 and 8802 DF, p-value: < 2.2e-16
##
## *** Regression call is: TempF ~ DewF ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.638 -5.485 -1.570 3.601 35.456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.317877 0.188557 54.72 <2e-16 ***
## DewF 1.004504 0.004095 245.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.407 on 8803 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.8724, Adjusted R-squared: 0.8724
## F-statistic: 6.019e+04 on 1 and 8803 DF, p-value: < 2.2e-16
##
## *** Regression call is: WindSpeed ~ Altimeter + TempF + DewF + month ***
##
## Call:
## lm(formula = formula(myChar), data = met)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.295 -2.635 -0.157 2.494 21.356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 291.341063 7.156664 40.709 < 2e-16 ***
## Altimeter -9.415029 0.237118 -39.706 < 2e-16 ***
## TempF 0.160498 0.006445 24.904 < 2e-16 ***
## DewF -0.205407 0.006912 -29.718 < 2e-16 ***
## monthFeb 0.649436 0.210063 3.092 0.001997 **
## monthMar 0.033603 0.220448 0.152 0.878850
## monthApr 0.092196 0.230608 0.400 0.689317
## monthMay -0.859361 0.258300 -3.327 0.000882 ***
## monthJun -1.564544 0.296635 -5.274 1.36e-07 ***
## monthJul -1.191767 0.314437 -3.790 0.000152 ***
## monthAug -1.319054 0.319647 -4.127 3.72e-05 ***
## monthSep 0.394263 0.299972 1.314 0.188768
## monthOct 0.511074 0.264003 1.936 0.052916 .
## monthNov 0.050679 0.235873 0.215 0.829882
## monthDec 1.741405 0.205243 8.485 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.984 on 8790 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.2494, Adjusted R-squared: 0.2482
## F-statistic: 208.6 on 14 and 8790 DF, p-value: < 2.2e-16
# Example for the basic function for San Diego, CA (2016) written to 'metar_ksan_2016_EDA.log'
ksan_2016 <- wrapCombinedEDA("metar_ksan_2016.rds",
readPath=filePath,
writeLogFile='metar_ksan_2016_EDA.log',
writeLogPDF='metar_ksan_2016_EDA.pdf',
writeLogPath=filePath
)
##
## Will use San Diego, CA (2016) as the description for metar_ksan_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2016_EDA.log
Finally, a function can be called to create the inputs to wrapCombinedEDA() for lower typing:
logAndPDFCombinedEDA <- function(tblName, filePath="./RInputFiles/ProcessedMETAR/") {
# Create the RDS file name
rdsName <- paste0("metar_", tblName, ".rds")
cat("\nRDS Name:", rdsName)
# Create the log file name
logName <- paste0("metar_", tblName, "_EDA.log")
cat("\nLog Name:", logName)
# Create the PDF file name
pdfName <- paste0("metar_", tblName, "_EDA.pdf")
cat("\nPDF Name:", pdfName)
# Call wrapCombinedEDA()
tbl <- wrapCombinedEDA(rdsName,
readPath=filePath,
writeLogFile=logName,
writeLogPDF=pdfName,
writeLogPath=filePath
)
# Return the tbl
tbl
}
This function can then be run for all of the relevant files (cached to avoid multiple runs):
# Run for 2016 only for kdtw, kewr, kgrb, kgrr, kiah, kind, klnk, kmke, kmsn, kmsp, ktvc
kdtw_2016 <- logAndPDFCombinedEDA("kdtw_2016")
##
## RDS Name: metar_kdtw_2016.rds
## Log Name: metar_kdtw_2016_EDA.log
## PDF Name: metar_kdtw_2016_EDA.pdf
## Will use Detroit, MI (2016) as the description for metar_kdtw_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kdtw_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kdtw_2016_EDA.log
kewr_2016 <- logAndPDFCombinedEDA("kewr_2016")
##
## RDS Name: metar_kewr_2016.rds
## Log Name: metar_kewr_2016_EDA.log
## PDF Name: metar_kewr_2016_EDA.pdf
## Will use Newark, NJ (2016) as the description for metar_kewr_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kewr_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kewr_2016_EDA.log
kgrb_2016 <- logAndPDFCombinedEDA("kgrb_2016")
##
## RDS Name: metar_kgrb_2016.rds
## Log Name: metar_kgrb_2016_EDA.log
## PDF Name: metar_kgrb_2016_EDA.pdf
## Will use Green Bay, WI (2016) as the description for metar_kgrb_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kgrb_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kgrb_2016_EDA.log
kgrr_2016 <- logAndPDFCombinedEDA("kgrr_2016")
##
## RDS Name: metar_kgrr_2016.rds
## Log Name: metar_kgrr_2016_EDA.log
## PDF Name: metar_kgrr_2016_EDA.pdf
## Will use Grand Rapids, MI (2016) as the description for metar_kgrr_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kgrr_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kgrr_2016_EDA.log
kiah_2016 <- logAndPDFCombinedEDA("kiah_2016")
##
## RDS Name: metar_kiah_2016.rds
## Log Name: metar_kiah_2016_EDA.log
## PDF Name: metar_kiah_2016_EDA.pdf
## Will use Houston, TX (2016) as the description for metar_kiah_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kiah_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kiah_2016_EDA.log
kind_2016 <- logAndPDFCombinedEDA("kind_2016")
##
## RDS Name: metar_kind_2016.rds
## Log Name: metar_kind_2016_EDA.log
## PDF Name: metar_kind_2016_EDA.pdf
## Will use Indianapolis, IN (2016) as the description for metar_kind_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kind_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kind_2016_EDA.log
klnk_2016 <- logAndPDFCombinedEDA("klnk_2016")
##
## RDS Name: metar_klnk_2016.rds
## Log Name: metar_klnk_2016_EDA.log
## PDF Name: metar_klnk_2016_EDA.pdf
## Will use Lincoln, NE (2016) as the description for metar_klnk_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_klnk_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_klnk_2016_EDA.log
kmke_2016 <- logAndPDFCombinedEDA("kmke_2016")
##
## RDS Name: metar_kmke_2016.rds
## Log Name: metar_kmke_2016_EDA.log
## PDF Name: metar_kmke_2016_EDA.pdf
## Will use Milwaukee, WI (2016) as the description for metar_kmke_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmke_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmke_2016_EDA.log
kmsn_2016 <- logAndPDFCombinedEDA("kmsn_2016")
##
## RDS Name: metar_kmsn_2016.rds
## Log Name: metar_kmsn_2016_EDA.log
## PDF Name: metar_kmsn_2016_EDA.pdf
## Will use Madison, WI (2016) as the description for metar_kmsn_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsn_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsn_2016_EDA.log
kmsp_2016 <- logAndPDFCombinedEDA("kmsp_2016")
##
## RDS Name: metar_kmsp_2016.rds
## Log Name: metar_kmsp_2016_EDA.log
## PDF Name: metar_kmsp_2016_EDA.pdf
## Will use Minneapolis, MN (2016) as the description for metar_kmsp_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsp_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsp_2016_EDA.log
ktvc_2016 <- logAndPDFCombinedEDA("ktvc_2016")
##
## RDS Name: metar_ktvc_2016.rds
## Log Name: metar_ktvc_2016_EDA.log
## PDF Name: metar_ktvc_2016_EDA.pdf
## Will use Traverse City, MI (2016) as the description for metar_ktvc_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ktvc_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ktvc_2016_EDA.log
# Run for 2015-2016-2017 for klas, kmsy, kord, ksan
klas_2015 <- logAndPDFCombinedEDA("klas_2015")
##
## RDS Name: metar_klas_2015.rds
## Log Name: metar_klas_2015_EDA.log
## PDF Name: metar_klas_2015_EDA.pdf
## Will use Las Vegas, NV (2015) as the description for metar_klas_2015.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2015_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2015_EDA.log
klas_2016 <- logAndPDFCombinedEDA("klas_2016")
##
## RDS Name: metar_klas_2016.rds
## Log Name: metar_klas_2016_EDA.log
## PDF Name: metar_klas_2016_EDA.pdf
## Will use Las Vegas, NV (2016) as the description for metar_klas_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2016_EDA.log
klas_2017 <- logAndPDFCombinedEDA("klas_2017")
##
## RDS Name: metar_klas_2017.rds
## Log Name: metar_klas_2017_EDA.log
## PDF Name: metar_klas_2017_EDA.pdf
## Will use Las Vegas, NV (2017) as the description for metar_klas_2017.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2017_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_klas_2017_EDA.log
kmsy_2015 <- logAndPDFCombinedEDA("kmsy_2015")
##
## RDS Name: metar_kmsy_2015.rds
## Log Name: metar_kmsy_2015_EDA.log
## PDF Name: metar_kmsy_2015_EDA.pdf
## Will use New Orleans, LA (2015) as the description for metar_kmsy_2015.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2015_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2015_EDA.log
kmsy_2016 <- logAndPDFCombinedEDA("kmsy_2016")
##
## RDS Name: metar_kmsy_2016.rds
## Log Name: metar_kmsy_2016_EDA.log
## PDF Name: metar_kmsy_2016_EDA.pdf
## Will use New Orleans, LA (2016) as the description for metar_kmsy_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2016_EDA.log
kmsy_2017 <- logAndPDFCombinedEDA("kmsy_2017")
##
## RDS Name: metar_kmsy_2017.rds
## Log Name: metar_kmsy_2017_EDA.log
## PDF Name: metar_kmsy_2017_EDA.pdf
## Will use New Orleans, LA (2017) as the description for metar_kmsy_2017.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2017_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kmsy_2017_EDA.log
kord_2015 <- logAndPDFCombinedEDA("kord_2015")
##
## RDS Name: metar_kord_2015.rds
## Log Name: metar_kord_2015_EDA.log
## PDF Name: metar_kord_2015_EDA.pdf
## Will use Chicago, IL (2015) as the description for metar_kord_2015.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2015_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2015_EDA.log
kord_2016 <- logAndPDFCombinedEDA("kord_2016")
##
## RDS Name: metar_kord_2016.rds
## Log Name: metar_kord_2016_EDA.log
## PDF Name: metar_kord_2016_EDA.pdf
## Will use Chicago, IL (2016) as the description for metar_kord_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2016_EDA.log
kord_2017 <- logAndPDFCombinedEDA("kord_2017")
##
## RDS Name: metar_kord_2017.rds
## Log Name: metar_kord_2017_EDA.log
## PDF Name: metar_kord_2017_EDA.pdf
## Will use Chicago, IL (2017) as the description for metar_kord_2017.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2017_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_kord_2017_EDA.log
ksan_2015 <- logAndPDFCombinedEDA("ksan_2015")
##
## RDS Name: metar_ksan_2015.rds
## Log Name: metar_ksan_2015_EDA.log
## PDF Name: metar_ksan_2015_EDA.pdf
## Will use San Diego, CA (2015) as the description for metar_ksan_2015.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2015_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2015_EDA.log
ksan_2016 <- logAndPDFCombinedEDA("ksan_2016")
##
## RDS Name: metar_ksan_2016.rds
## Log Name: metar_ksan_2016_EDA.log
## PDF Name: metar_ksan_2016_EDA.pdf
## Will use San Diego, CA (2016) as the description for metar_ksan_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2016_EDA.log
ksan_2017 <- logAndPDFCombinedEDA("ksan_2017")
##
## RDS Name: metar_ksan_2017.rds
## Log Name: metar_ksan_2017_EDA.log
## PDF Name: metar_ksan_2017_EDA.pdf
## Will use San Diego, CA (2017) as the description for metar_ksan_2017.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2017_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ksan_2017_EDA.log
Reload Traverse City, MI (2016) due to a previous error (short-term fix to avoid re-running a full cache):
ktvc_2016 <- logAndPDFCombinedEDA("ktvc_2016")
##
## RDS Name: metar_ktvc_2016.rds
## Log Name: metar_ktvc_2016_EDA.log
## PDF Name: metar_ktvc_2016_EDA.pdf
## Will use Traverse City, MI (2016) as the description for metar_ktvc_2016.rds
##
##
## EDA PDF file is available at: ./RInputFiles/ProcessedMETAR/metar_ktvc_2016_EDA.pdf
##
## EDA log file is available at: ./RInputFiles/ProcessedMETAR/metar_ktvc_2016_EDA.log
The files are now available for further analysis, with individual PDF files for plots associated with each locale.
With EDA about each locale saved to .pdf and .log files, it is interesting to investigate comparisons among the various locales. For example, how do the temperature patterns by month for a given locale compare to the overall global median temperature patterns by month?
The existing functions contain most of the code needed to perform this. The main steps to add are:
The first step is to combine one or more processed files, with a column added for locale:
# Combine files from a character list
combineProcessedFiles <- function(charList, mapper=cityNameMapper) {
# Combine the objects represented by charList, and name the list items using charList
listFiles <- lapply(charList, FUN=function(x) { get(x) })
names(listFiles) <- charList
# Bind rows, and add the descriptive locale name as sourceName
tblFiles <- bind_rows(listFiles, .id="source") %>%
mutate(sourceName=mapper[source])
tblFiles
}
The 2016 data will be used to run the combined process:
# Grab all the data that ends in _2016
locales2016 <- ls() %>%
grep(pattern="_2016", value=TRUE)
cat("\nLocales used will be:\n\n", paste0(locales2016, collapse="\n"), "\n\n", sep="")
##
## Locales used will be:
##
## kdtw_2016
## kewr_2016
## kgrb_2016
## kgrr_2016
## kiah_2016
## kind_2016
## klas_2016
## klnk_2016
## kmke_2016
## kmsn_2016
## kmsp_2016
## kmsy_2016
## kord_2016
## ksan_2016
## ktvc_2016
# Combine the 2016 data
all2016Data <- combineProcessedFiles(locales2016)
# Show counts by sourceName
all2016Data %>%
count(source, sourceName)
## # A tibble: 15 x 3
## source sourceName n
## <chr> <chr> <int>
## 1 kdtw_2016 Detroit, MI (2016) 8818
## 2 kewr_2016 Newark, NJ (2016) 8821
## 3 kgrb_2016 Green Bay, WI (2016) 8803
## 4 kgrr_2016 Grand Rapids, MI (2016) 8812
## 5 kiah_2016 Houston, TX (2016) 8816
## 6 kind_2016 Indianapolis, IN (2016) 8767
## 7 klas_2016 Las Vegas, NV (2016) 8818
## 8 klnk_2016 Lincoln, NE (2016) 8813
## 9 kmke_2016 Milwaukee, WI (2016) 8808
## 10 kmsn_2016 Madison, WI (2016) 8798
## 11 kmsp_2016 Minneapolis, MN (2016) 8817
## 12 kmsy_2016 New Orleans, LA (2016) 8813
## 13 kord_2016 Chicago, IL (2016) 8815
## 14 ksan_2016 San Diego, CA (2016) 8810
## 15 ktvc_2016 Traverse City, MI (2016) 8814
The following global summaries will be useful:
Helper functions can be created for:
The function helperFactorNumeric is created to apply function f to numeric variable y by factor variable x:
# Helper function to get overall percentage by metric for variable x
helperCountsByMetric <- function(tbl, ctVar, sumOn="dummyVar") {
tbl %>%
mutate(dummyVar=1) %>%
select_at(vars(all_of(c(ctVar, sumOn)))) %>%
filter_all(all_vars(!is.na(.))) %>%
group_by_at(ctVar) %>%
summarize(n=sum(get(sumOn))) %>%
mutate(nPct=n/sum(n))
}
# Example run to get counts by greatest sky obscuration
helperCountsByMetric(all2016Data, ctVar="wType")
## # A tibble: 7 x 3
## wType n nPct
## <fct> <dbl> <dbl>
## 1 VV 761 0.00576
## 2 OVC 39480 0.299
## 3 BKN 31352 0.237
## 4 SCT 16212 0.123
## 5 FEW 19304 0.146
## 6 CLR 24433 0.185
## 7 Error 601 0.00455
# Helper function to get a geom_smooth for variable y vs variable x
helperNumCor <- function(tbl,
xVar,
yVar,
sumOn="dummyVar",
se=TRUE,
color="red",
method="lm",
lty=2
) {
# Generate the overall totals for sumOn by xVar and yVar
plotData <- tbl %>%
mutate(dummyVar=1) %>%
select_at(vars(all_of(c(xVar, yVar, sumOn)))) %>%
filter_all(all_vars(!is.na(.))) %>%
group_by_at(vars(all_of(c(xVar, yVar)))) %>%
summarize(nTotal=sum(get(sumOn)))
geom_smooth(data=plotData,
aes_string(x=xVar, y=yVar, weight="nTotal"),
se=se,
color=color,
method=method,
lty=lty
)
}
# Example run to get TempC vs DewC
helperNumCor(all2016Data, xVar="TempC", yVar="DewC")
## mapping: weight = ~nTotal, x = ~TempC, y = ~DewC
## geom_smooth: na.rm = FALSE, se = TRUE
## stat_smooth: na.rm = FALSE, se = TRUE, method = lm, formula = y ~ x
## position_identity
# Example for using the helper function on a plot
plotNumCor(kdtw_2016, var1="TempC", var2="DewC") +
helperNumCor(all2016Data, xVar="TempC", yVar="DewC")
# Helper function to calculate .f(numVar) by byVar
helperFactorNumeric <- function(tbl, .f, byVar, numVar) {
tbl %>%
select_at(vars(all_of(c(byVar, numVar)))) %>%
filter_all(all_vars(!is.na(.))) %>%
group_by_at(byVar) %>%
summarize(helpFN=.f(get(numVar)))
}
# Example for getting median TempF by month
helperFactorNumeric(all2016Data, .f=median, byVar="month", numVar="TempF")
## # A tibble: 12 x 2
## month helpFN
## <fct> <dbl>
## 1 Jan 30.9
## 2 Feb 35.1
## 3 Mar 46.9
## 4 Apr 52.0
## 5 May 64.0
## 6 Jun 73.0
## 7 Jul 77
## 8 Aug 75.9
## 9 Sep 71.1
## 10 Oct 61.0
## 11 Nov 50
## 12 Dec 34.0
The function plotCountsByMetric() has been updated above to allow for facetting and plotting of the overall central tendency. Two examples are shown - the base from previous, and a facetted example:
# Previous Example for Detroit 2016 - using WindDir, cType1, month, wType
plotcountsByMetric(kdtw_2016,
mets=c("WindDir", "cType1", "month", "wType"),
title="Detroit, MI (2016)",
dropNA=TRUE,
diagnose=TRUE
)
##
## Dropping 1 rows with 49 observations due to NA
# Facetted example for kdtw_2016, kord_2016, klas_2016, ksan_2016
useData <- all2016Data %>%
filter(source %in% c("kdtw_2016", "klas_2016", "kord_2016", "ksan_2016"))
plotcountsByMetric(useData,
mets=c("WindDir", "cType1", "month", "wType"),
title="Comparison Across Locales (red dots are the median)",
dropNA=TRUE,
diagnose=TRUE,
facetOn="sourceName",
showCentral=TRUE
)
##
## Dropping 4 rows with 117 observations due to NA
As observed previously, Las Vegas tends towards southerly winds while San Diego tends towards northwesterly winds and calm (direction 000) winds. Detroit is most likely to be overcast, while Las Vegas is most likely to be clear.
The function plotNumCor() has been updated above to allow for facetting and plotting of the overall central tendency. Two examples are shown - the base from previous, and a facetted example:
# Example for Detroit 2016 - using TempC and DewC
plotNumCor(kdtw_2016, var1="TempC", var2="DewC", subT="Detroit, MI (2016)", diagnose=TRUE)
##
## Dropping 1 rows with 49 observations due to NA
# Facetted example for kdtw_2016, kord_2016, klas_2016, ksan_2016
useData <- all2016Data %>%
filter(source %in% c("kdtw_2016", "klas_2016", "kord_2016", "ksan_2016"))
# Facetted plot for very highly correlated variables TempC and TempF
plotNumCor(useData,
var1="TempC",
var2="TempF",
subT="Comparison Across Locales (red dashed lines are the overall)",
dropNA=TRUE,
diagnose=TRUE,
facetOn="sourceName",
showCentral=TRUE
)
##
## Dropping 4 rows with 117 observations due to NA
# Facetted plot for highly correlated variables TempF and DewF
plotNumCor(useData,
var1="TempF",
var2="DewF",
subT="Comparison Across Locales (red dashed lines are the overall)",
dropNA=TRUE,
diagnose=TRUE,
facetOn="sourceName",
showCentral=TRUE
)
##
## Dropping 4 rows with 117 observations due to NA
The differences in the relationships between temperature and dew point stand out:
In contrast, of course, temperatures measured in C and F all follow the same pattern regardless of city
The function plotFactorNumeric() has been updated above to allow for facetting and plotting of the overall central tendency. Two examples are shown - the base from previous, and a facetted example:
# Example for Detroit 2016 - using TempF and month
plotFactorNumeric(kdtw_2016,
fctVar="month",
numVar="TempF",
subT="Detroit, MI (2016)",
showXLabel=FALSE,
diagnose=TRUE
)
##
## Removing 49 records due to NA
# Facetted example for kdtw_2016, kord_2016, klas_2016, ksan_2016
useData <- all2016Data %>%
filter(source %in% c("kdtw_2016", "klas_2016", "kord_2016", "ksan_2016"))
plotFactorNumeric(useData,
fctVar="month",
numVar="TempF",
subT="Overall median shown as red dot",
showXLabel=FALSE,
diagnose=TRUE,
facetOn="sourceName",
showCentral=TRUE
)
##
## Removing 117 records due to NA
Las Vegas consistently runs above the overall median temperature, while Chicago and Detroit run below the overall median temperature, particularly during the cold season. San Diego has little temperature variation by month and is thus below the median in the warm season and above the median in the cold season.
It is now possible to re-run the EDA plotting routines, focused on the 2016 data:
# Create rounded TempF and DewF in all2016Data
all2016Data <- all2016Data %>%
mutate(TempF5=5*round(round(TempF)/5),
DewF5=5*round(round(DewF)/5),
WindSpeed5=5*round(WindSpeed/5),
Altimeter10=round(Altimeter, 1)
)
# Counts by Metric for all 2016 data
plotcountsByMetric(all2016Data,
mets=c("month", "year",
"WindDir", "WindSpeed5",
"Visibility", "Altimeter10",
"TempF5", "DewF5",
"wType"
),
title="Comparisons Across Locales (red dots are the median)",
facetOn="sourceName",
showCentral=TRUE
)
The cross-locale comparisons bring out a few salient features:
DATA VOLUMES:
* All locales have roughly the same amount of data by year and month, focused on 2016 and with 1-2 days on either side
WIND DIRECTION and WIND SPEED:
* Las Vegas has an excess of no/variable wind and of southerly winds, both appropriate for a desert
* Houston has an excess of no wind and southerly winds, both appropriate for the Gulf Coast
* San Diego has an excess of no wind and of northwesterly wind, both appropriate for the Pacific coast
* Chicago, Grand Rapids, Indianapolis, Minneapolis, and Newark all have lower occurences of no wind, appropriate for relatively cold mid-latitude cities
* Lincoln looks “about normal”, with the exception that it has slightly more southerly winds; Lincoln is the only Great Plains locale in the analysis, and it appears to be a blend of Gulf Coast and Wintry
* Detroit, Green Bay, Milwaukee, and New Orleans all look “about average”; this is not surprising in the first three cases given the predominance of cold, mid-latitude locales, but is surprising for New Orleans
* Madison and Traverse City are surprising in that both show a predominance of no/variable winds; this is uncommon for cold-weather cities in the mid-latitude and merits further examination
VISIBILITY:
* The overwhelming majority of visibilities are 10SM (the highest that is recorded; more or less means unlimited in the METAR)
* There is a data issue with a Visibility > 10 that should be addressed
* Las Vegas is slightly more likely than most to have unlimited Visibility and Detroit is slightly less likely than most to have unlimited visibility
ALTIMETER:
* Las Vegas skews low as appropriate for a high-altitude desert locale
* New Orleans and Houston show less variance, perhaps driven by being roughly at sea level and in close proximity to the Gulf of Mexico
* San Diego shows very low variance, perhaps driven by being roughly at sea level and in close proximity to the Pacific Ocean
TEMPERATURE:
* Houston, Las Vegas, and New Orleans skew warm as expected
* San Diego has ver low variance as expected
* At a gross level, the other cities look similar to the median, likely driven by the predominance of cold, mid-latitude locales in the data file
DEW POINT:
* Houston and New Orleans skew very high as expected
* Las Vegas skews very low as expected
* San Diego has very low variance as expected
SKY OBSCURATION:
* Lincoln and Green Bay are the most likely to be CLR (clear, no clouds on the automated sensor). This may be driven by a difference in maximum sensor heights, and is unexpected in Green Bay which should be frequently cloudy due to its latitude and proximity to a large body of water
* Detroit, Traverse City, Grand Rapids, and Minneapolis are especially likely to be overcast
* Las Vegas is especially likely to have few clouds or to be clear
Comparisons are run for a few of the numerical correlations:
# Example for 2016 - using mixes of WindSpeed, Altimeter, TempF, DewF, TempC, DewC
numCorList <- list(c("TempC", "TempF"),
c("DewC", "DewF"),
c("TempF", "DewF"),
c("Altimeter", "WindSpeed"),
c("Altimeter", "TempF")
)
# Run the list through plotNumCor()
for (x in numCorList) {
plotNumCor(all2016Data,
var1=x[1],
var2=x[2],
subT="Red dashed line is the overall slope",
diagnose=TRUE,
facetOn="sourceName",
showCentral=TRUE
)
}
##
## Dropping 15 rows with 593 observations due to NA
##
## Dropping 15 rows with 593 observations due to NA
##
## Dropping 15 rows with 593 observations due to NA
##
## Dropping 15 rows with 593 observations due to NA
##
## Dropping 15 rows with 593 observations due to NA
The cross-locale comparisons bring out a few salient features:
FAHRENHEIT AND CELSIUS:
* As expected, TempF/C are perfectly correlated and DewF/C are perfectly correlated. Since the observations were taken in the US, the TempF/DewF data will be used (TempC/DewC are conversions from the measured TempF/DewF to match the international standard for METAR reporting)
TEMPERATURE AND DEW POINT:
* While many cities have different clusters of temperature/dew point, all but Las Vegas and San Diego follow a pattern where the temperature and the dew point tend to run together at a similar rate
* In Las Vegas, the dew point is largely independent of the temperature
* In San Diego, there is less variance in temperature and dew point, and a lower (but still obvious) tendency for tenmperature and dew point to rise/fall together
ALTIMETER AND WIND SPEED:
* As expected, when the altimeter rises, on average, the wind speed falls
* This tendency is less pronounced in Houston and New Orleans; and more pronounced in Las Vegas, Dan Diego, and Indianapolis
ALTIMETER AND TEMPERATURE:
* Overall, low temperatures and high altimeters tend to be observed together
* This is especially so in Houston, Las Vegas, and New Orleans
* This is more modest in Grand Rapids, Green Bay, and Traverse City
Comparisons are then run for numeric variables against factor variables:
# Modify windDir so that it is just N, NE, E, SE, S, SW, W, NW, 000, Variable
mod2016Data <- all2016Data %>%
mutate(tempDir=ifelse(is.na(WindDir) | WindDir %in% c("000", "VRB"), -1, as.numeric(WindDir)),
predomDir=factor(case_when(is.na(WindDir) ~ "Error",
WindDir=="000" ~ "000",
WindDir=="VRB" ~ "VRB",
tempDir >= 337.5 ~ "N",
tempDir <= 22.5 ~ "N",
tempDir <= 67.5 ~ "NE",
tempDir <= 112.5 ~ "E",
tempDir <= 157.5 ~ "SE",
tempDir <= 202.5 ~ "S",
tempDir <= 247.5 ~ "SW",
tempDir <= 292.5 ~ "W",
tempDir <= 337.5 ~ "NW",
TRUE ~ "Error"
),
levels=c("Error", "000", "VRB", "NE", "E", "SE", "S", "SW", "W", "NW", "N")
)
)
## Warning in ifelse(is.na(WindDir) | WindDir %in% c("000", "VRB"), -1,
## as.numeric(WindDir)): NAs introduced by coercion
# Key factor variables include month, wType, predomDir
# Key numeric variables include WindSpeed, Altimeter, TempF, DewF, Visibility
fctNumList <- list(c("month", "WindSpeed"),
c("month", "Altimeter"),
c("month", "TempF"),
c("month", "DewF"),
c("month", "Visibility"),
c("wType", "WindSpeed"),
c("wType", "Altimeter"),
c("wType", "Visibility"),
c("predomDir", "WindSpeed"),
c("predomDir", "Altimeter"),
c("predomDir", "TempF"),
c("predomDir", "DewF")
)
for (x in fctNumList) {
plotFactorNumeric(mod2016Data,
fctVar=x[1],
numVar=x[2],
subT="Red dots are the overall average",
showXLabel=FALSE,
diagnose=TRUE,
facetOn="sourceName",
showCentral=TRUE
)
}
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
##
## Removing 593 records due to NA
The cross-locale comparisons bring out a few salient points:
WIND SPEED BY MONTH:
* The plot is too busy; need to rethink, potentially by removing the outliers and plotting only the box
ALTIMETER BY MONTH:
* Same as above; this plot is too busy
TEMPERATURE BY MONTH:
* Seasonal patterns are observed in most of the data, with a warm season centered around July and a cold season centered around December
* Las Vegas stand out for running warmer than the overall average in every month
* Houston, New Orleans, and San Diego stand out for running warmer than the overall average in the cold season and similar to the overall average in the warm season
DEW POINT BY MONTH:
* Seasonal patterns are observed in most of the data, with the humid seasons tracking with the warm seasons
* Houston and New Orleans run consistently above the average dew point by month
* San Diego runs above the average dew point during the cold season
* Las Vegas runs below the average dew point during the warm season
VISIBILITY BY MONTH:
* Visibilities are overwhelmingly likely to be 10 SM
* The Newark, NJ outlier described earlier (19 SM) should be deleted
* Detroit is especially likely to have Visibility less than 10 SM
* Grand Rapids and Traverse City have meaningul occurences of Visibility less than 10 SM during the cold season
WIND SPEED BY SKY OBSCURATION:
* This chart is too busy as per above
ALTIMETER BY SKY OBSCURATION:
* Not much at a glance
VISIBILITY BY SKY OBSCURATION:
* The VV and OVC sky obscurations are most associated with low visibilities; VV in particular is almost always associated with very low visibility
WIND SPEED BY WIND DIRECTION:
* Wind direction “000” is always associated with wind speed 0, as expected
* Wind direction “VRB” is always associated with a low but non-zero wind speed, as expected
* While some cities are windier than others, there is no pronounced tendency for wind speed to be highly associated with a given wind direction in any locale
ALTIMETER BY WIND DIRECTION:
* No gross trends observed
* The plot is rather busy, especially given the low variance in median/IQR for altimeter relative to the outlier points
TEMPERATURE/DEW POINT BY WIND DIRECTION:
* Plot is not great for reading and interpreting
Next steps are to modify a few of the plots for better interpretability (less busy, more variation of the core data metric relative to the full y-axis, etc.), investigate and rectify the data issues observed, and save a version of the file for further analysis.